48 research outputs found
Towards The Efficient Use Of Fine-Grained Provenance In Datascience Applications
Recent years have witnessed increased demand for users to be able to interpret the results of data science pipelines, locate erroneous data items in the input, evaluate the importance of individual input data items, and acknowledge the contributions of data curators. Such applications often involve the use of the provenance at a fine-grained level, and require very fast response time. To address this issue, my goal is to expedite the use of fine-grained provenance in applications within both the database and machine learning domains, which are ubiquitous in contemporary data science pipelines. In applications from the database domain, I focus on the problem of data citation and provide two different types of solutions, Rewriting-based solutions and Provenance-based solutions, to generate fine-grained citations to database query results by implicitly or explicitly leveraging provenance information. In applications from the ML domain, the first considers the problem of incrementally updating ML models after the deletions of a small subset of training samples. This is critical for understanding the importance of individual training samples to ML models, especially in online pipelines. For this problem, I provide two solutions, PrIU and DeltaGrad, to incrementally update ML models constructed by SGD/GD methods, which utilize provenance information collected during the training phase on the full dataset before the deletion requests. The second application from the ML domain that I focus on is to explore how to clean label uncertainties located in the ML training dataset in a more efficient and cheaper manner. To address this problem, I proposed a solution, CHEF, to reduce the cost and the overhead at each phase of the label cleaning pipeline and maintain the overall model performance simultaneously. I also propose initial ideas for how to remove some assumptions used in these solutions to extend them to more general scenarios
MDB: Interactively Querying Datasets and Models
As models are trained and deployed, developers need to be able to
systematically debug errors that emerge in the machine learning pipeline. We
present MDB, a debugging framework for interactively querying datasets and
models. MDB integrates functional programming with relational algebra to build
expressive queries over a database of datasets and model predictions. Queries
are reusable and easily modified, enabling debuggers to rapidly iterate and
refine queries to discover and characterize errors and model behaviors. We
evaluate MDB on object detection, bias discovery, image classification, and
data imputation tasks across self-driving videos, large language models, and
medical records. Our experiments show that MDB enables up to 10x faster and
40\% shorter queries than other baselines. In a user study, we find developers
can successfully construct complex queries that describe errors of machine
learning models
Dynamic Gaussian Mixture based Deep Generative Model For Robust Forecasting on Sparse Multivariate Time Series
Forecasting on sparse multivariate time series (MTS) aims to model the
predictors of future values of time series given their incomplete past, which
is important for many emerging applications. However, most existing methods
process MTS's individually, and do not leverage the dynamic distributions
underlying the MTS's, leading to sub-optimal results when the sparsity is high.
To address this challenge, we propose a novel generative model, which tracks
the transition of latent clusters, instead of isolated feature representations,
to achieve robust modeling. It is characterized by a newly designed dynamic
Gaussian mixture distribution, which captures the dynamics of clustering
structures, and is used for emitting timeseries. The generative model is
parameterized by neural networks. A structured inference network is also
designed for enabling inductive analysis. A gating mechanism is further
introduced to dynamically tune the Gaussian mixture distributions. Extensive
experimental results on a variety of real-life datasets demonstrate the
effectiveness of our method.Comment: This paper is accepted by AAAI 202
Identification of Prognostic Genes and Pathways in Lung Adenocarcinoma Using a Bayesian Approach
Lung cancer is the leading cause of cancer-associated mortality in the United States and the world. Adenocarcinoma, the most common subtype of lung cancer, is generally diagnosed at the late stage with poor prognosis. In the past, extensive effort has been devoted to elucidating lung cancer pathogenesis and pinpointing genes associated with survival outcomes. As the progression of lung cancer is a complex process that involves coordinated actions of functionally associated genes from cancer-related pathways, there is a growing interest in simultaneous identification of both prognostic pathways and important genes within those pathways. In this study, we analyse The Cancer Genome Atlas lung adenocarcinoma data using a Bayesian approach incorporating the pathway information as well as the interconnections among genes. The top 11 pathways have been found to play significant roles in lung adenocarcinoma prognosis, including pathways in mitogen-activated protein kinase signalling, cytokine-cytokine receptor interaction, and ubiquitin-mediated proteolysis. We have also located key gene signatures such as RELB, MAP4K1, and UBE2C. These results indicate that the Bayesian approach may facilitate discovery of important genes and pathways that are tightly associated with the survival of patients with lung adenocarcinoma
Influence of turbid flood water release on sediment deposition and phosphorus distribution in the bed sediment of the Three Gorges Reservoir, China
Excessive phosphorus (P) loading was identified as an urgent problem during the post-Three Gorges Reservoir (TGR) period. Turbid water with high suspended sediment loads has been periodically released during the flood season to mitigate sediment deposition in the TGR, but limited attention has been paid to its effect on the distribution of P in bed sediment within the reservoir. In this study, field surveys, historical monitoring data related to sediment deposition, and physiochemical properties and the fractional P content in the mainstream surface sediment and representative column sediment, were used to investigate the effect of turbid flood water release on P distribution in bed sediment. The results revealed that turbid flood water release could discharge approximately 20% of the suspended sediment inflow entering the TGR. Additionally, both the particle size of the inflow sediment and suspended sediment flux tended to decline, and the deposited sediment volume tended to constantly increase in the TGR at a rate of 0.117 billion tonnes per year between 2004 and 2016. The median particle size (MPS) was larger for surface sediment obtained in the flood season than for that obtained in the dry season, and the MPS tended to increase with an increase in the sediment depth from 0 to 20 cm. The total phosphorus (TP) content in sediment ranged from 2.6% to 17.5% lower in the flood water releasing period than in the non-flood water storing period. However, no consistent variation was detected for the vertical distribution of P fraction in the top 20 cm of bed sediment. Compared with lakes with slow deposition rates, the TGR showed a rapid sedimentation rate of >1.0 m/y, which mostly resulted in the uniform distribution of the surface sediment P fraction